Assessing the practical usability of an automatically annotated corpus

نویسندگان

Md. Faisal Mahbub Chowdhury

Alberto Lavelli

چکیده

The creation of a gold standard corpus (GSC) is a very laborious and costly process. Silver standard corpus (SSC) annotation is a very recent direction of corpus development which relies on multiple systems instead of human annotators. In this paper, we investigate the practical usability of an SSC when a machine learning system is trained on it and tested on an unseen benchmark GSC. The main focus of this paper is how an SSC can be maximally exploited. In this process, we inspect several hypotheses which might have influenced the idea of SSC creation. Empirical results suggest that some of the hypotheses (e.g. a positive impact of a large SSC despite of having wrong and missing annotations) are not fully correct. We show that it is possible to automatically improve the quality and the quantity of the SSC annotations. We also observe that considering only those sentences of SSC which contain annotations rather than the full SSC results in a performance boost.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HunOr: A Hungarian-Russian Parallel Corpus

In this paper, we present HunOr, the first multi-domain Hungarian–Russian parallel corpus. Some of the corpus texts have been manually aligned and split into sentences, besides, named entities also have been annotated while the other parts are automatically aligned at the sentence level and they are POS-tagged as well. The corpus contains texts from the domains literature, official language use...

متن کامل

Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora

Large and open multiparallel corpora are a valuable resource for contrastive corpus linguists if the data is annotated and stored in a way that allows precise and flexible ad hoc searches. A linguistic query language should also support computational linguists in automated multilingual data mining. We review a broad range of approaches for linguistic query and reporting languages according to u...

متن کامل

Towards Creation of a Corpus for Argumentation Mining the Biomedical Genetics Research Literature

Argumentation mining involves automatically identifying the premises, conclusion, and type of each argument as well as relationships between pairs of arguments in a document. We describe our plan to create a corpus from the biomedical genetics research literature, annotated to support argumentation mining research. We discuss the argumentation elements to be annotated, theoretical challenges, a...

متن کامل

A Debug Tool for Practical Grammar Development

We have developed willex, a tool that helps grammar developers to work efficiently by using annotated corpora and recording parsing errors. Willex has two major new functions. First, it decreases ambiguity of the parsing results by comparing them to an annotated corpus and removing wrong partial results both automatically and manually. Second, willex accumulates parsing errors as data for the d...

متن کامل

The Sense Boundary Decision and the Sense Labeling from Collocation Clustering

This paper discusses the deciding practical sense boundary of homonymous words. One of the serious problems in making dictionaries or thesauri is in the vague boundary of senses. This also becomes a bottleneck in sense disambiguation for practical language processing systems. This paper proposes a deciding method for sense boundary discovery of homonyms using collocation from large corpora and ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Assessing the practical usability of an automatically annotated corpus

نویسندگان

چکیده

منابع مشابه

HunOr: A Hungarian-Russian Parallel Corpus

Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora

Towards Creation of a Corpus for Argumentation Mining the Biomedical Genetics Research Literature

A Debug Tool for Practical Grammar Development

The Sense Boundary Decision and the Sense Labeling from Collocation Clustering

عنوان ژورنال:

اشتراک گذاری